1 Client Bio

The United Nations (UN) is an international organisation that is committed to maintaining international peace and security, promoting social progress, and human rights (United Nations, 2017). The organisation actively promotes educational and economic development as key elements of its sustainable development goals. https://www.un.org/en

2 Recommendation

  • Statistical analysis reveals a robust correlation between educational attainment and economic prosperity, evidenced by higher Gross National Income (GNI) per capita with increased schooling years.

  • This report recommends that the United Nations intensifies efforts to extend mean years of schooling globally, particularly in underperforming regions. Such a strategy not only aligns with the UN’s Sustainable Development Goal 4 and 8 (THE 17 GOALS | Sustainable Development, 2015) but also promises significant economic uplift.

  • Implementing this recommendation will advance global economic stability and reduce inequalities, reinforcing the UN’s commitment to sustainable development.

3 Evidence

3.1 Initial Data Analysis (IDA)

3.1.1 Source:

The dataset used is titled “Average global IQ per country with other stats”. It was collected and formatted by Matheus Felipe on Kaggle (https://www.kaggle.com/datasets/mlippo/average-global-iq-per-country-with-other-stats?resource=download)

3.1.2 Limitation:

There are many rows with missing data in mean years of schooling and GNI which may lead to inaccurate and non-inclusive results.

3.1.3 Structure:

The dataset consists of 193 rows, each row represents a country or small territory. There are 10 columns of different variables:

  • Rank

  • Country

  • Average IQ

  • Continent

  • Literacy Rate

  • Nobel Prizes

  • HDI

  • Mean years of schooling: Mean years of education that a country’s citizens receive.

  • GNI: Gross National Income of that country.

  • Population

3.1.4 Data Cleaning:

Command read_csv classifies the columns when the dataset was imported. The column labels were changed from question format to short and precise names for convenience and reusability.

options(repos = c(CRAN = "https://cloud.r-project.org/"))
data = read.csv("~/Desktop/DATA1001/avgIQpercountry.csv")
colnames(data) = c(
  "rank",
  "country",
  "averageiq",
  "continent",
  "literacyrate",
  "nobelprizes",
  "hdi",
  "meanschoolyears",
  "gni",
  "population"
)

3.1.5 R’s Summary of Data:

# Quick look at top 5 rows of data
head(data)
##   rank     country averageiq continent literacyrate nobelprizes   hdi
## 1    1       Japan    106.48      Asia         0.99          29 0.925
## 2    2      Taiwan    106.47      Asia         0.96           4    NA
## 3    3   Singapore    105.89      Asia         0.97           0 0.939
## 4    4   Hong Kong    105.37      Asia         0.94           1 0.952
## 5    5       China    104.10      Asia         0.96           8 0.768
## 6    6 South Korea    102.35      Asia         0.98           0 0.925
##   meanschoolyears   gni population
## 1            13.4 42274  123294513
## 2              NA    NA   10143543
## 3            11.9 90919    6014723
## 4            12.2 62607    7491609
## 5             7.6 17504 1425671352
## 6            12.5 44501   51784059
## Size of data
dim(data)
## [1] 193  10
## R's classification of data
class(data)
## [1] "data.frame"
## R's classification of variables
str(data)
## 'data.frame':    193 obs. of  10 variables:
##  $ rank           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ country        : chr  "Japan" "Taiwan" "Singapore" "Hong Kong" ...
##  $ averageiq      : num  106 106 106 105 104 ...
##  $ continent      : chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ literacyrate   : num  0.99 0.96 0.97 0.94 0.96 0.98 1 1 1 0.99 ...
##  $ nobelprizes    : int  29 4 0 1 8 0 2 5 0 111 ...
##  $ hdi            : num  0.925 NA 0.939 0.952 0.768 0.925 0.808 0.94 0.935 0.942 ...
##  $ meanschoolyears: num  13.4 NA 11.9 12.2 7.6 12.5 12.1 12.9 12.5 14.1 ...
##  $ gni            : int  42274 NA 90919 62607 17504 44501 18849 49452 146830 54534 ...
##  $ population     : chr  "123294513" "10143543" "6014723" "7491609" ...

3.2 Linear Regression Models

According to Marquez-Ramos, education is important as it affects the country’s economic growth (Marquez-Ramos). From the IQ dataset, education is quantified as mean years of schooling and economic growth is measured by Gross National Income (GNI).

To clarify how mean years of schooling impacts economic outcomes in varied contexts, countries are divided into lower and higher GNI groups based on median GNI, reducing variability within groups and enabling more precise comparisons.

median_gni <- median(data$gni, na.rm = TRUE)  

lower_gni_group <- data[data$gni <= median_gni,]
higher_gni_group <- data[data$gni > median_gni,]

In order to ensure that a relationship between education and economic growth exists and visualise such a relationship, scatter plots are used for both lower GNI and higher GNI groups.

library(tidyverse)
ggplot(lower_gni_group, aes(x = meanschoolyears, y = gni)) +
  geom_point() +
  geom_smooth(method="lm", colour = "#1b95e0", se = FALSE) +
  theme_classic() +
  ggtitle("Relationship of Mean School Years and GNI of countries with lower GNI") +
  xlab("Country's Mean Years of School") +
  ylab("Country's Gross National Income (GNI)")

library(tidyverse)
ggplot(higher_gni_group, aes(x = meanschoolyears, y = gni)) +
  geom_point() +
  geom_smooth(method="lm", colour = "#1b95e0", se = FALSE) +
  theme_classic() +
  ggtitle("Relationship of Mean School Years and GNI of countries with higher GNI") +
  xlab("Country's Mean Years of School") +
  ylab("Country's Gross National Income (GNI)")

The two scatter plots demonstrate that as the countries’ mean years of school increases, their GNI also increases, therefore, there is a positive relationship between the two variables. This is consistent with current research showing that as individuals receive more education, their income tends to increase (Clarke, 2022), which in turn aligns with the recommendation that higher education levels can lead to a greater country’s GNI.

After a visible relationship between mean years of school and GNI has formed, a linear model is used to quantitatively measure this relationship, providing precise estimates and enabling statistical testing for significance.

low <- lm(gni ~ meanschoolyears, data = lower_gni_group)
summary(low)
## 
## Call:
## lm(formula = gni ~ meanschoolyears, data = lower_gni_group)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5480.1 -1986.6  -188.3  2025.1  5302.9 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -822.5      775.2  -1.061    0.292    
## meanschoolyears    953.4      107.6   8.863 7.87e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2626 on 88 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.4716, Adjusted R-squared:  0.4656 
## F-statistic: 78.55 on 1 and 88 DF,  p-value: 7.867e-14
high <- lm(gni ~ meanschoolyears, data = higher_gni_group)
summary(high)
## 
## Call:
## lm(formula = gni ~ meanschoolyears, data = higher_gni_group)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29554 -13388  -4527   6654 104273 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -26650      14640  -1.820   0.0721 .  
## meanschoolyears     5537       1275   4.341 3.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20470 on 87 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.178,  Adjusted R-squared:  0.1686 
## F-statistic: 18.85 on 1 and 87 DF,  p-value: 3.813e-05

Both p-values from the linear regression models of the lower and higher GNI groups are below 0.05, indicating a statistically significant relationship between mean years of schooling and GNI in both economic tiers, confirming that education positively impacts economic performance across different levels of national income.

3.3 Bar Plot

Creating a bar plot to compare mean of mean school years by GNI levels provides a clear, visual method for highlighting educational disparities between lower and higher GNI countries.

mean_school_years_low <- mean(lower_gni_group$meanschoolyears, na.rm = TRUE)
mean_school_years_high <- mean(higher_gni_group$meanschoolyears, na.rm = TRUE)

group_means <- data.frame(
  gni_level = c("Low", "High"),
  mean_education_level = c(mean_school_years_low, mean_school_years_high)
)

library(ggplot2)
ggplot(group_means, aes(x = gni_level, y = mean_education_level, fill = gni_level)) +
  geom_bar(stat = "identity", position = position_dodge(), show.legend = FALSE) +
  scale_fill_manual(values = c("Low" = "darkblue", "High" = "darkgreen")) +
  theme_minimal() +
  labs(
    title = "Comparison of Mean of Mean School Years by GNI Level",
    x = "GNI Level",
    y = "Mean School Years"
  )

The bar plot shows that higher GNI countries tend to have more years of schooling. According to Vegas, wealthier nations invest more in education which suggests that increasing educational opportunities in lower GNI countries could be a key strategy for economic development (Vegas, 2020).

3.4 Hypothesis Testing

H0 = There is no difference in the mean years of schooling between the lower GNI group and the higher GNI group.

H1 = There is a difference in the mean years of schooling between the lower GNI group and the higher GNI group.

A Welch Two Sample t-test is performed on the mean school years of lower GNI group and higher GNI group. As the p-value is smaller than 0.05, there is sufficient evidence to reject the null hypothesis. Therefore, there is a difference in mean years of schooling between the two groups of GNI.

t.test(lower_gni_group$meanschoolyears, higher_gni_group$meanschoolyears, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  lower_gni_group$meanschoolyears and higher_gni_group$meanschoolyears
## t = -14.106, df = 154.54, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.267637 -3.973512
## sample estimates:
## mean of x mean of y 
##  6.731111 11.351685

4 Acknowledgments

United Nations (2017). About Us | United Nations. United Nations. https://www.un.org/en/about-us

THE 17 GOALS | Sustainable Development. (2015). Un.org. https://sdgs.un.org/goals

Marquez-Ramos, L., & Mourelle, E. (2019). Education and economic growth: an empirical analysis of nonlinearities. Applied Economic Analysis, 27(79), 21–45. https://doi.org/10.1108/aea-06-2019-0005

Clarke, M. (2022). Income - Department of Education, Australian Government. Department of Education. https://www.education.gov.au/integrated-data-research/benefits-educational-attainment/income

Vegas, E. (2020, June 19). Investing in public education worldwide is now more important than ever. Brookings. https://www.brookings.edu/articles/investing-in-public-education-worldwide-is-now-more-important-than-ever/

5 Appendix

5.1 Client Choice

United Nations is chosen as the client for this project’s recommendation because of its global influence and commitment to promoting quality education.

5.2 Statisitcal Analyses

Independence: The samples are all independent as each country is only used once.

5.2.1 Hypothesis

This is done in section 3.4

5.2.2 Assumptions

5.2.2.1 Linear Regression Models

  • Scatter plots: Linearity of scatter plots can be found in section 2.2.

  • Residual plots: The points on both residual plots appear to randomly scatter around the horizontal axis which demonstrates homoscedastic.

Thus, the two assumptions are met for the linear regression models that are done in section 2.2.

library(ggplot2)
model_low <- lm(gni ~ meanschoolyears, data = lower_gni_group)  
residuals_df <- data.frame(
    resid = resid(model_low),
    fitted = fitted(model_low)
)

ggplot(residuals_df, aes(x = fitted, y = resid)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  
    labs(x = "Fitted Values", y = "Residuals", title = "Residual vs. Fitted Plot for Lower GNI group") +
    theme_minimal()

library(ggplot2)
model_high <- lm(gni ~ meanschoolyears, data = higher_gni_group)  
residuals_df <- data.frame(
    resid = resid(model_high),
    fitted = fitted(model_high)
)

ggplot(residuals_df, aes(x = fitted, y = resid)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  
    labs(x = "Fitted Values", y = "Residuals", title = "Residual vs. Fitted Plot for Higher GNI group") +
    theme_minimal()

5.2.2.2 Hypothesis Testing

Normality:

  • Box plot: The comparative box plots display symmetrical distribution and absence of significant skewness or outliers. This suggests that the data conforms well to a normal distribution.

  • QQ plots: The QQ plots show a straight line, confirming that the data points closely adhere to a normal distribution.

  • Normality Check: As both sample size are both larger than 30, the Central Limit Theorem ensures the sample means are approximately normal.

qqnorm(lower_gni_group$meanschoolyears); qqline(lower_gni_group$meanschoolyears)

qqnorm(higher_gni_group$meanschoolyears); qqline(higher_gni_group$meanschoolyears)

boxplot(lower_gni_group$meanschoolyears, higher_gni_group$meanschoolyears, names=c("Lower GNI", "Higher GNI"), main="Mean School Years Comparative Boxplots")

paste('Sample size of lower GNI group:', sum(complete.cases(lower_gni_group)))
## [1] "Sample size of lower GNI group: 90"
paste('Sample size of higher GNI group:', sum(complete.cases(higher_gni_group)))
## [1] "Sample size of higher GNI group: 89"

Equal Spread:

  • Variance Test: As p-value is smaller than 0.05, the null hypothesis is rejected, the variances of the groups are not equal. Thus, instead of 2 Sample T-test, Welch Two Sample t-test is used in section 2.3.

    H0: The variances of the groups are equal

    H1: The variances of the groups are not equal

var.test(lower_gni_group$meanschoolyears, higher_gni_group$meanschoolyears)
## 
##  F test to compare two variances
## 
## data:  lower_gni_group$meanschoolyears and higher_gni_group$meanschoolyears
## F = 2.2875, num df = 89, denom df = 88, p-value = 0.000131
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.503518 3.478428
## sample estimates:
## ratio of variances 
##           2.287468

5.2.3 Test Statistic and P-value

With a t-value of -14.106 and a p-value < 2.2e-16, there is a statistically significant difference between mean school years of lower GNI and higher GNI groups.

5.2.4 Conclusion

Statistical conclusion: As p-value < 0.05, the null hypothesis is rejected. There is a difference in the mean years of schooling between the lower GNI group and the higher GNI group

Scientific conclusion: The data suggests that a country’s mean years of school does have an effect on its GNI.

5.3 Limitations

  • Temporal relevance: Variables like HDI, GNI, and Mean years of schooling were collected in 2021, which may not reflect recent changes, potentially limiting the applicability of this project’s findings.